-
Notifications
You must be signed in to change notification settings - Fork 4.2k
AEP-8515: Add support for setting a custom request-to-limit ratio at the VPA object level #8516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
AEP-8515: Add support for setting a custom request-to-limit ratio at the VPA object level #8516
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @iamzili. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the AEP!
I've left 2 small comments, but will definitely need to come back for a more thorough review.
* **Factor**: Multiplies the recommended request by a specified value, and the result is set as the new limit. | ||
* Example: if `factor` is set to `2`, the limit will be set to twice the recommended request. | ||
* **Quantity**: Adds a buffer on top of the resource request. This can be expressed either: | ||
* As a **percentage** (`QuantityPercentage`), or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would QuantityPercentage not be the same as factor?
ie: QuantityPercentage of 10% is the same as factor of 1.1? (assuming that factor wasn't stored as an int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I was thinking the same: simplify things by dropping QuantityPercentage
and renaming QuantityValue
to Quantity
.
When I first thought about this feature, I thought it might be useful to give users more options to define the ratio's magnitude. But this will just make VPA more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also considering changing the way the sub-fields are defined, based on AEP-7862, to make the VPA CRD more consistent. For example:
from:
RequestToLimitRatio:
cpu:
Type: Factor
Value: 2
memory:
Type: Quantity
Value: 100Mi
to
RequestToLimitRatio:
cpu:
factor: 2
memory:
quantity: 100Mi
btw after reviewing AEP-7862, I think consistency is broken there when startupBoost
is defined under containerPolicies
:
startupBoost:
cpu:
type: "Quantity"
quantity: "4"
vs
startupBoost:
cpu:
factor: 1
We should define a common specification to be used for both AEPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think this change makes sense. And good call out on the inconstancy in the startup boost feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, hold on, where is the consistency broken? type
default to Factor
.
So the second example you pasted implies the following:
startupBoost:
cpu:
type: "Factor"
factor: 1
This is consistent with the other methods, unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right! I missed the defaulting part from AEP 7863.
But the consistency between the AEP I created and AEP 7863 is still broken because:
from startupBoost
:
Type: "Factor"
Factor: 1
from RequestToLimitRatio
:
Type: "Factor"
Value: 1 # do I need to rename the key to "Factor"?
If you ask me, I think the startupBoost
spec is more aligned with the VPA object. Since the startupBoost
AEP is already merged into the main branch, I assume we need to follow the same approach in the RequestToLimitRatio
AEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, there should be consistency between this AEP and the startupBoost AEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency fixed
#### When Enabled | ||
|
||
* The admission controller will **accept** new VPA objects that include a configured `RequestToLimitRatio`. | ||
* For containers targeted by a VPA object using `RequestToLimitRatio`, the admission controller and/or the updater will enforce the configured ratio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in a scenario like this:
- A user has a pod with in-place set. And where requests are equal to limits (and the QoS class is Guaranteed)
- The user modifies the VPA to add a RequestToLimitRatio, making limits double requests
What should the behaviour be? Should the VPA kill the pod to allow a new one be created with a different QoS class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally forgot about QoS classes, thanks for bringing this up!
Since the qosClass
field is immutable, we cannot rely on in-place updates, the Pod(s) need to be recreated if the QoS class changes.
To clarify, changing the ratio using this new feature will not evict the Pod(s) immediately. It will rely on the current Updater's behavior like when the current resources.requests
differ significantly from the new recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my concern is actually more of "how will the VPA handle the RequestToLimitRatio changing". Either from the default to a specified ratio, or from one ratio to another.
I think it's worth calling this out in the AEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tagging me in Slack to take a look, and for this AEP!
I added a few questions and ideas.
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
### Test Plan | ||
|
||
* Implement comprehensive unit tests to cover all new functionality. | ||
* e2e tests: TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test extremely high factors like 1 million in addition to more typical factors like 1, 5, 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to test something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily need to given the response above that limits aren't used for scheduling but it might turn up something interesting in how Kubernetes handles it. You never know what bug might lurk with limits that have never been tested before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tspearconquest , what do you want to test here? if you wanna test a pod with 1 million factor this is unrelated to the VPA work. In tests we need to make sure the VPA behaves as planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you want to test here?
Integer overflows on the recommender status fields.
if you wanna test a pod with 1 million factor this is unrelated to the VPA work.
Okay, no problem.
In tests we need to make sure the VPA behaves as planned.
No argument here. I'm probably overthinking this, as I don't know Go code very well and don't have the level of understanding you all do with regards to how the recommender status fields and the VPA pods would work with a value exceeding the int64 limit of ~8 exbibytes when converted to an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah valid points, in the HPA we do the same tests but as a unit tests.
I guess we can make some unit tests around that as well.
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we first agree on this AEP and then we can mark this for an api-review.
Is this step for the core maintainers, or can I help with it to move things forward somehow? |
Once the maintainers are ok with that we will move this to the api-machinery folks for review. |
### Test Plan | ||
|
||
* Implement comprehensive unit tests to cover all new functionality. | ||
* e2e tests: TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of leaving this as a TODO, please add some coverage here. The E2E test should be straightforward - just verify the limit across a few different scenarios. A couple of simple cases would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added some e2e tests.
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
/ok-to-test |
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
fa84adf
to
093542b
Compare
vertical-pod-autoscaler/enhancements/8515-support-custom-request-to-limit-ratio/README.md
Outdated
Show resolved
Hide resolved
RequestToLimitRatio: | ||
cpu: | ||
type: Factor # this field is optional, if omitted it defaults to "Factor" | ||
factor: 2 | ||
memory: | ||
type: Quantity | ||
quantity: 200Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to specify the type if we already have it in the value itself?
meaning - why can't we do:
RequestToLimitRatio:
cpu:
factor: 2
memory:
quantity: 200Mi
Why the "type" is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was considering dropping the type
field entirely, but I followed the specs defined in AEP-7862 to ensure consistency in the VPA CRD. My original spec for RequestToLimitRatio
was different.
In AEP-7862, the type
defaults to Factor
, as in this AEP. That said I agree with you that it might make sense to drop the type
completely in this AEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So like in AEP-7862 we can see the following examples:
resourcePolicy:
containerPolicies:
- containerName: "boosted-container-name"
mode: "Auto" # Vanilla VPA mode + Startup Boost
minAllowed:
cpu: "250m"
memory: "100Mi"
maxAllowed:
cpu: "500m"
memory: "600Mi"
# The CPU boosted resources can go beyond maxAllowed.
startupBoost:
cpu:
value: 4
or
resourcePolicy:
containerPolicies:
- containerName: "boosted-container-name"
mode: "StartupBoostOnly"
startupBoost:
cpu:
factor: 2.0
(This was taken from the AEP proposal).
So the factor/value field should be changed, the type field can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you see those examples?
Over in https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost#api-changes, I see:
If not specified, StartupBoost.CPU.Type defaults to Factor
Something I like about having the Type
, is that we could one day update it to be"MaxOf` (or something similar) to select the max of the value or calculated factor.
For example, having a factor of 1.1 for a small 100MiB pod only gives you an extra 10MiB, which is very small.
But may be in the future you want to have a factor of 1.1 and a quantity of 100MiB, that way at the low end you get a 100MiB butter, but at the high end you get 10%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was looking at the wrong branch ( sorry about that ).
Yeah I agree, having something like MaxOf
can be useful indeed - let's keep it that way then :)
Overall seems ok to me. |
Agreed! |
/api review |
/label api-review |
|
||
A new `RequestToLimitRatio` field will be added, with the following sub-fields: | ||
|
||
* [Optional] `RequestToLimitRatio.CPU.Type` or `RequestToLimitRatio.Memory.Type` (type `string`): Specifies how to apply limits proportionally to the requests. `Type` can have the following values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be in line with CPU Startup Boost feature, this needs to be a required field. See #8608
(Sorry for changing it out from under you)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem, updated
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: iamzili The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
If the request-to-limit ratio needs to be updated (for example, because the application's resource usage has changed), users must modify the `resources.requests` or `resources.limits` fields in the workload's API object. Since these fields are immutable, this change results in terminating and recreating the existing Pods. | ||
|
||
This proposal introduces a new mechanism that allows VPA users to adjust the request-to-limit ratio directly at the VPA CRD level for an already running workload. This avoids the need to manually update the workload's resource requests and limits, and prevents unnecessary Pod restarts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about approaching this proposal from a slightly different perspective:
-
Lean into the custom proportional scaling configuration part (evolve the current "... are scaled automatically" functionality that is an outcome of setting
ContainerControlledValues
toRequestsAndLimits
.)- in other words, this proposal gives users an option of using
ContainerControlledValues=RequestsAndLimits
in a non-automatic, user-configurable way.
- in other words, this proposal gives users an option of using
-
Don't worry about the "in-place vs recreate pod" challenge.
My thinking is based on the maturation of in-place-pod-resizing (scheduled for GA in 1.35). Is there a reason why we can't rely upon IPPR for handling the in-place objectives of this use case. We should be able to give folks who want an automatic ContainerControlledValues=RequestsAndLimits
experience the option to trigger automatic proportionality while modifying the inputs at the original workload API object layer (e.g., pod requests/limits directly), and offer the option of in-place via configuring the workload to use IPPR.
@adrianmoisey @omerap12 @natasha41575 @tallclair do my thoughts above make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while modifying the inputs at the original workload API object layer (e.g., pod requests/limits directly), and offer the option of in-place via configuring the workload to use IPPR
Please correct me if I've misunderstood, but is your suggestion to relax the immutability of pod requests/limits at the workload level and support IPPR for workloads? There's an open issue for that support: kubernetes/kubernetes#132436. I'm a little hesitant to promise this though because of the complexity - if VPA or other top-level controllers can handle the use cases in a simpler way, that would be my preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, thank you @natasha41575, that is indeed what I'm suggesting (and hadn't realized that request/limits immutability was out-of-scope for IPPR thus far — my bad).
Given this current state of things, I would suggest that this AEP mention that there is an open issue to address workload request/limits immutability, and that this proposal may offer a temporary bridge for folks who may want to leverage VPA to actually address that use-case.
The tl;dr of my perspective is that I think this would be a more durable description of what we are proposing:
This proposal introduces a new mechanism that allows VPA users to adjust the request-to-limit ratio directly at the VPA CRD level. This mechanism can also be used to to update existing workload + VPA configurations, for example to non-disruptively scale workloads beyond what the their request/limits would predict.
(instead of stating "for an already running workload" as a base case)
Does that make sense?
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
Autoscaling Enhancement Proposal (AEP) to add support for setting a custom request-to-limit ratio at the VPA object level.
I'd love to hear your thoughts on this feature.
#8515